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Abstract 

A degenerate symbol x over an alphabet S is a non-empty subset of 
S, and a sequence of such symbols is a degenerate string. A degenerate 
string is said to be conservative if its number of non-solid symbols is 
upper-bounded by a fixed positive constant k. We consider here the 
matching problem of conservative degenerate strings and present the 
hrst linear-time algorithm that can find, for given degenerate strings 
P and T of total length n containing k non-solid symbols in total, the 
occurrences of P in T in 0{nk) time. 


1 Introduction 

Degenerate, or indeterminate, strings are found in Biology, Musicology and 
Cryptography. They are defined by the occurrence of one or more positions 
which are represented by sets of symbols. In conservative degenerate strings, 
the number of such occurrences is bounded by k. In music, single notes may 
match chords. In encrypted and biological sequences, a position in one string 
may match exactly with various symbols in other strings. 

Previous algorithmic research of degenerate strings has been focused on pat¬ 
tern matching. Pattern matching in degenerate strings is particularly rele¬ 
vant in the context of coding biological sequences. Due to the degeneracy of 
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the genetic code, two dissimilar DNA sequences can be translated into two 
identical protein sequences. Without taking this degeneracy into account, 
many associations between biological entities can be overlooked. For exam¬ 
ple, the following six DNA codons are all translated into the amino acid 
Leucine: TTA, TTG, CTT, CTC, CTA and CTG. This example highlights 
the signihcance of solving problems relating to degeneracy in strings. In fact, 
special symbols to represent sets of DNA symbols have long been established 
by the lUPAC-IUBMB Biochemical Nomenclature Committee [1]. For exam¬ 
ple, R represents any purine {A or G),Y represents any pyrimidine {G, T or 
U) and N represents any nucleic acid. An example of practical implications 
of such research is in the design of primers for cloning DNA sequences using 
PCR (Polymerase Chain Reaction). Degenerate primers are used when their 
design is based on protein sequences, which can be reverse-translated to 
different sequences, where n is the length of the sequence. 

This paper introduces an algorithm which is a signihcant improvement from 
those published previously. The hrst signihcant contribution for the problem 
of pattern matching of degenerate strings was in 1974 [2], and was later im¬ 
proved [3]. Later still, faster algorithms for the same problem were proposed 
[4, 5]. Since, many practical methods have been suggested [6-8], as well 
as variations of the problem considered. For example, a non-practical gen¬ 
eralised string matching algorithm was introduced by Abrahamson in 1987 
[9]. Most recently, Crochemore et al. [10] reported an algorithm to hnd the 
shortest solid cover in a degenerate string with time complexity 0(2^). We 
report here a major improvement in time; 0{kn). Further to the problem of 
pattern matching, the linear algorithm reported here can be applied to many 
different problems, including hnding cover and prehx arrays. 

The rest of the paper is organised in the following format; The next section 
introduces the vocabulary and the notions that will be used in this paper. 
Section 3 formally dehnes the problem and presents the algorithm we have 
proposed. The algorithm is analysed in Section 4 and hnally, Setion 5 con¬ 
cludes the paper. 
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2 Preliminaries 


To provide an overview of our results we begin with a few definitions, gener¬ 
ally following [ 8 , 10]. An alphabet S is a non-empty finite set of symbols of 
size |S|. A string over a given alphabet is a hnite sequence of symbols. The 
length of a string x is denoted by |x|. The empty string is denoted by e. The 
set of all strings over an alphabet S (including empty string e) is denoted by 
ST 

A degenerate symbol x over an alphabet S is a non-empty subset of S, i.e., 
5; C S and x 7 ^ 0. \x\ denotes the size of the set and we have 1 < |x| < |S|. 
A hnite sequence X = X1X2 ■ ■ ■ Xn is said to be a degenerate string if Xi is a 
degenerate symbol for each i from 1 to n. In other words, a degenerate string 
is built over the potential — 1 non-empty sets of letters belonging to S. 
The number of the degenerate symbols, n here, in a degenerate string X is 
its length, denoted as |X|. For example, X = [a, 6 ][a][c][ 6 , c][a][a, 6 , c] is a de¬ 
generate string of length 6 over S = [a, b, cj. If \xi\ = 1, that is, Xi represents 
a single symbol of S, we say that Xi is a solid symbol and i is a solid position. 
Otherwise Xi and i are said to be non-solid symbol and non-solid position re¬ 
spectively. For convenience we often write = c (c G S), instead of Xi = [c], 
in case of solid symbols. Consequently, the degenerate string X mentioned 
in the example previously will be written as [a,b]ac[b,c\a[a,b,c]. A string 
containing only solid symbols will be called a solid string. Also as a conven¬ 
tion, capital letters will be used to denote strings while small letters will be 
used for representing symbols. Furthermore, the degeneracy will be indicated 
by a tilde, for example, X denotes a degenerate string while a plain letter 
like X represents a solid string. The empty degenerate string is denoted by e. 

A conservative degenerate string is a degenerate string where its number of 
non-solid symbols is upper-bounded by a hxed positive constant k. The con¬ 
catenation of degenerate strings X and Y is XY. A degenerate string C is a 
substring (resp. prefix, suffix) of a degenerate string X li X = UVW (resp. 
X = VW, X = UV) for some degenerate strings U and W. By X[i..j], we 
represent a substring XjXj+i .. .Xj of x. 

For degenerate strings, the notion of symbol equality is extended to single¬ 
symbol match between two degenerate symbols in the following way. Two 
degenerate symbols x and y are said to match (represented as x ~ y) if 
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X n y ^ Extending this notion to degenerate strings, we say that two 
degenerate strings X and Y match (denoted as X ^ E ) if |X| = |y| and 
corresponding symbols in X and Y match, i.e., for each i = 1, • • • , |X| we 
have Xi ~ yi. Note that the relation ^ is not transitive. A degenerate string 
X is said to occur at position i in another degenerate (resp. solid) string Y 
(resp. r) if X ^ Y[i..i + |X]| - 1] (resp. X ^ Y[i..i + |X]| - 1]). 


3 Conservative Degenerate String Matching 

Problem 1. Given a conservative degenerate pattern P with k non-solid 
symbols, and a solid text T, find all positions in T at which P occurs. 

Example 1. We consider a degenerate pattern, P = a[bc\da[bd] with k = 2 
and a text, T = dacdabdadcabdac . Table 1 shows that P occurs in T at 
positions 2 and 5. 






Table 1: 

Occurrence of P 

in T 
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a 
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d 

a 
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[be] 

d 

a 
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For convenience, we compute a table Pre[k, |S|] such that for each non-solid 
position i {I < i < k) and each letter a G S, we have Pre[i, a] = 1 if a G P[i] 
and 0 otherwise. After such 0(/c|S|)-time preprocessing, we can check in 
0(1) time whether a non-solid position in P matches a position in T or not. 

An Outline of Our Approach 

Our algorithm to solve Problem 1 is built on the top of an adapted version 
of the sequential algorithm presented by Landau and Vishkin to find all 
occurrences of a (solid) pattern P of length m in a (solid) text T of length 
n with at most e differences each [11], where a difference can be due to 
either a mismatch between the corresponding characters of the text and the 
pattern, or a superfluous character in the text, or a superfluous character 
in the pattern. The modification required for our strategy is to treat only 
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mismatches as the differences in Landau and Vishkin’s algorithm. On the 
lines of the original Landau and Vishkin’s algorithm, the modified one works 
in the following two steps . 

Step 1; Compute the suffix tree of the string obtained after concatenating 
the text, the pattern and a character which is not present in S U A, 
i.e. TP^] using the serial algorithm of Weiner [12], 

Step 2; Let Mismatchij be the position in the pattern at which we have 
mismatch (when defined) between T[i + l..i + m] and In other 

words, Mismatchij = f represents mismatch from left to right and 
implies that A+/ 7 ^ pf. In this step, we find Mismatchij for each i and j 
such that 0 < f < n—m and 1 < j < c +1 where c denotes the maximum 
of the two ; e and the total number of mismatches between T[i+l..i+m] 
and P[l..m]. If some Mismatchij = m + 1, it signifies that there is an 
occurrence of the pattern in the text, starting at t[z + 1 ], with at most 
e mismatches. Mismatchij can be computed from Mismatchij^i as 
follows : 

Let LCAsi^sj be the lowest common ancestor (in short LCA) of the 
leaves of the suffixes T[si + l,n] and P[sj + 1] in the suffix tree and 
\LCAsi^sj\ denotes its length. Mismatchij^i = f implies that T[i + 
l..i + f] and P[l..f] is matched with j — 1 mismatches. We want to 
find the largest q such that T[i + f + l..i + f + q] = P[f + 1../ + q] and 
U+q+i 7 ^ Pq+1, SO that Mismatchij = q + 1. The desired q is same as 
length of LCAi+fj. Thus, Mismatchij = f + \LCAi+fj\. 

Pseudocode for our approach is given as Algorithm 1. It works in the follow¬ 
ing three stages : 

Stage 1: Substitute 

In the first stage, each of the non-solid symbols occurring in the given de¬ 
generate pattern is replaced by a unique symbol which is not present in S. 
A represents the set of these unique symbols i.e. {Aj} such that 0 < i < k. 
It is to be noted that the pattern, px, obtained by such a substitution will 
be a solid string. For example, Px obtained from P in Example 1 is given in 
Table 2. 

Definition 3.1. We define A positions as the positions in Px which contain 
{Ai} G A. Note that these are same as the non-solid positions in P. 
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Table 2; [Stage 1; Substitute] P\ obtained from P 


p 

a [be] d a [bd\ 

Px 

a Ai d a X 2 


Stage 2: Approximate Pattern Search 

The next stage comprises of using modified Landau and Vishkin’s algorithm 
to search pattern P\ (solid) in text T (solid) with at most k mismatches in 
each occurrence. First, a suffix tree for the (solid) string TP\ is constructed. 
Then, LCA queries on this suffix tree are used to compute Mismatchij for 
each i and j such that 0 < i < n — m and 1 < j < k + 1. As explained 
in Remark 3.1, j will vary up to A: + 1 in P^’s case. Every i, such that 
Mismatchi^k+i = m + 1, marks the beginning of an occurrence of P\ in T 
(at f + 1 ) and thus added to the set ApproximateMatch. 

Figure 1 demonstrates the suffix tree for the string obtained from concate¬ 
nating T from Example 1 and Px from the previous step, i.e TPx which 
is dacdabdadcabdacaXidaX24^. Note that each node of the suffix tree is 
stored as a pair (start, length) that represents the contiguous substring 
S[start + 1..start + length]. In addition, a leaf node indicates the suffix 
it represents. A leaf node showing i denotes a suffix S[i + l..|5'|]. Table 3 
shows the resultant Mismatch[0..n — m, l..k + 1] array . This table provides 
the positions in Px where it mismatches with the corresponding character in 
T. For example, Mi,smatch[7, 1] = 2 denotes that the first mismatch between 
T[ 8 ,12] and Px occurs at position 2 in Px and rightly so as t[8 + 2] = f[10] 
(i.e. c) does not match with Pa[ 2] (i.e. Ai). As Px occurs in T with at most 2 
mismatches at locations 2,5 and 11 (rows 1,4 and 10 contain 6 , i.e. m + 1), 
ApproximateM atch = [1,4,10]. 

Remark 3.1. There will always be a mismatch between Px and T at X posi¬ 
tions as each of the Xi E A is unique and does not occur in S and hence in 
T. As there are k X positions, at least k mismatches are bound to be there 
for each position i in the text starting at which the pattern is being matched 
against. More explicitly, each occurrence recorded in ApproximateM atch has 
k mismatches exactly. 
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Figure 1; [Stage 2: Approximate Pattern Match] Suffix Tree for 
TPx# 



Table 3: [STAGE 2; Approximate Pattern Search] Mismatch array 
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Stage 3: Filter 

An occurrence in ApproximateMatch reports a mismatch at a A position 
even if there is a match at the corresponding non-solid position in reality. For 
example, if some Xi has been substituted at a non-solid position containing, 
say [6, c], and the corresponding symbol in T is c, clearly it is a match but 
that position will be recorded as a ‘mismatch’ in array Mismatch because 
Xi does not match with c. Thus, a mismatch of all the k mismatches, found 
in an occurrence of solid P\ in T identihed by ApproximateM atch in the 
preceeding step, can be seen as either real or fake when considered with 
respect to the match of the degenerate pattern P and T. 

Definition 3.2. A mismatch at a position, say e = Mismatch[i, j], is real 
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if the corresponding symbols in the degenerate pattern P and the text T 
mismatch, i.e. t[i + e] ^ p[e]. Otherwise, the mismatch is fake. 

Remark 3.2. A mismatch at a solid position will always be real while one 
at a X position can either be real or fake. 

Definition 3.3. An approximate occurrence is an occurrence of Px in T with 
k mismatches whereas an occurrence of P in T with exact match is called an 
exact occurrence. 

Remark 3.3. It follows from Remarks 3.1 and 3.2 that if there is a mismatch 
even at a single solid position, total number of mismatches will exceed k and 
such an occurrence will not figure as an approximate occurrence. Conversely, 
an appoximate occurence will have mismatches only at A positions. 

For each location i in the text where an approximate occurrence of Px has 
been found {i G ApproximateMatch), each position of mismatch (A posi¬ 
tions) in the pattern is checked for whether the mismatch is real or not. If 
an approximate occurrence of pattern Px in text T contains a real mismatch, 
it can be observed that it cannot represent an exact occurrence of P whereas 
the approximate occurrence containing only fake mismatches will be same as 
an exact occurrence. The set of all such exact occurrences is the solution to 
our Problem 1. This step, therefore, hlters out and discards the approximate 
occurrences with real errors. 

Table 4 elucidates this stage for the example being considered. With values 
given by ApproximateMatch = [1,4,10] from the previous stage, we test 
each A position from A = [2, 5] to check if the mismatch is real or fake. At 
hrst A position (i.e. 2 ), t[l + 2] (i.e. c) matches p[ 2 ] (i.e. [b,c\), thus the 
mismatch is fake. The mismatch for the second A position (i.e. 5) is also 
fake owing to the fact that f[ 6 ] ~ p[5]. Therefore, location 2 is recorded as 
an occurrence of exact match of P in T. Similar is the case of location 5 
(i.e. value 4). But for value 10, even if the hrst mismatch is fake (t[12] (i.e. 
b) ^ p[2] (i.e. [b,c\)), the fact that t[15] (i.e. c) 9 ^ p[5] (i.e. [b,d]) makes 
the second mismatch real. Therefore, location 10 is discarded. And thus the 
correct solution to Example 1 is obtained. 



Table 4: [Stage 3: Filter] Checking Mismatches in Approximate Occur¬ 
rences of P in T 



4 Agorithm Analysis 

Theorem 4.1. Algorithm 1 correctly computes all occurrences of P inT in 
0{kn) time complexity. 

Proof. Landau and Vishkin’s algorithm correctly hnds all occurrences of P\ 
in T with at most k mismatches in 0{kn) time for a fixed alphabet. P\ differs 
from P only at the A positions which are equal to k in number. In addition, 
each of the A positions causes a mismatch. Notably, an exact occurrence of P 
in T will be given by an approximate occurrence of Px in T with mismatches 
only at A positions and all of these mismatches must be fake. All such oc¬ 
currences where mismatches occur only at k X positions are guaranteed to 
be captured by the approximate occurrences given in Approximate Match. 
Also, as a consequence of Remark 3.3, an approximate occurrence (for which 
number of mismatches are at most k) will never have a mismatch at any solid 
position. The hltering stage checks each of the mismatches in an approxi¬ 
mate occurrence and if all of these mismatches are found to be fake, we have 
an exact occurrence. Thus, at the end of the hltering stage, we have all the 
occurrences of an exact match only. 

The substitution stage can be performed in 0{n) time. As mentioned pre¬ 
viously, the approximate pattern-search stage using modihed Landau and 
Vishkin’s algorithm computes ApproximateMatch in 0{kn)) time for a hxed 
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Algorithm 1 Conservative Degenerate String Matching Algorithm 

Input: Pattern P of length m, 

Text T of length n, 

Number of non-solid symbols k 
Output: The set of indices of T where P occurs in T 

> Substitute: 

1: A ^ {Ail Ai ^ S and 0 < i < k} 

2: P\ ^string obtained after substituting non-solid symbol in P with Ai in A V i 
such that 0 < i < k 

> Approximate Pattern Search: 

3: Build Suffix Tree for the string TP\^ 

4: ApproximateMatch <— 0 

5: for i 0 to n — m do 

6 : / ^ 0 

7: for j -s— 1 to /c -f 1 do 

8: Mismatch[i,j] = f + \LCAi+fj\ 

9: f Mismatch[i,j] 

10 : end for 

11: if Mismatch[i, fc -I- 1] = m -I- 1 then *** approximate occurrence found *** 

12: Add i to ApproximateM atch 

13: end if 

14: end for 

> Filter: 

15: Occ ^ 0 

16: for each i G ApproximateM atch do 

17: flagAllFake ^ true 

18: for each e G A do 

19: if t[i -t e] 76 p[e] then 

20: flagAllFake ■(— false 

21: Break *** real mismatch *** 

22 : end if 

23: end for 

24: if flagAllFake then *** all fake mismatches *** 

25: Add i -f 1 to Occ *** exact occurrence found *** 

26: end if 

27: end for 
28: return Occ 


sized alphabet as the suffix tree is constructed in linear time with respect to 
the size of the input string [n + m) and computation of Mismatch array 
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(lines 5 to 14) takes 0{kn) time. The filtering stage, in the worst case 
{ApproximateMatch contains 0 to n — m), needs to process each location in 
T and to check whether mismatch at every A position is real or fake. This 
check can be performed in constant time after 0(A:|S|)-time pre-processing 
as mentioned earlier, which yields 0{kn) time requirements for this stage. 
Thus, in 0{k\Ti\+n+kn+kn)) = 0{kn) time Algorithm 1 correctly computes 
all occurrences of P in T. □ 

Corollary. Given degenerate strings P and T of total length n containing k 
non-solid symbols in total, one can compute occurrences of P in T in 0{nk) 
time. 


5 Conclusion 

In this paper, we studied the matching problem of conservative degenerate 
strings and presented an efficient algorithm that can hnd, for given degenerate 
strings P and T of total length n containing k non-solid symbols in total, the 
ocurrences of P in T in 0{nk) time, i.e. linear to the size of the input. In 
particular, we used the novel technique of substituting the non-solid symbols 
in the given degenerate strings with unique solid symbols, which let us make 
use of the efficient approximate pattern search solution for solid strings to 
get an efficient solution for degenerate strings. It would be interesting to 
see how well the presented algorithm behaves in practice and to apply it to 
solve a vast number of problems like prehx/border array, suffix trees, covers, 
repetitions, seeds, decomposition etc. 
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